Decentralized Heterogeneous Multi-Player Multi-Armed Bandits with Non-Zero Rewards on Collisions
نویسندگان
چکیده
We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions rewards. The environment may appear differently to different players, i.e., reward distributions for given arm are heterogeneous across players. In case of collision (when more than one player plays same arm), we allow colliding receive non-zero time-horizon T which arms played is not known Within this setup, number allowed be greater arms, present policy that achieves near order-optimal expected regret order O(log1+δ T) δ > 0 (however small) over duration T.
منابع مشابه
Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits
In this paper, we introduce a multi-agent multi-armed bandit-based model for ad hoc teamwork with expensive communication. The goal of the team is to maximize the total reward gained from pulling arms of a bandit over a number of epochs. In each epoch, each agent decides whether to pull an arm and hence collect a reward, or to broadcast the reward it obtained in the previous epoch to the team a...
متن کاملTrading off Rewards and Errors in Multi-Armed Bandits
In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather gene...
متن کاملOn Kernelized Multi-armed Bandits
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization – Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward...
متن کاملMulti-Armed Bandits with Betting
In this paper we consider an extension where the gambler has, at each round, K coins available for play, and the slot machines accept bets. If the player bets m coins on a machine, then the machine will return m times the payoff of that round. It is important to note that betting m coins on a machine results in obtaining a single sample from the rewards distribution of that machine (multiplied ...
متن کاملContextual Multi-Armed Bandits
We study contextual multi-armed bandit problems where the context comes from a metric space and the payoff satisfies a Lipschitz condition with respect to the metric. Abstractly, a contextual multi-armed bandit problem models a situation where, in a sequence of independent trials, an online algorithm chooses, based on a given context (side information), an action from a set of possible actions ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Information Theory
سال: 2022
ISSN: ['0018-9448', '1557-9654']
DOI: https://doi.org/10.1109/tit.2021.3136095